Lab 3 - Extending Logistic Regression

Preparation and Overview

Let's open and visualize our data:

Question: Explain the task and what business-case or use-case it is designed to solve (or designed to investigate). Detail exactly what the classification task is and what parties would be interested in the results. For example, would the model be deployed or used mostly for offline analysis?

Question: Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).

Pre-processing:

As it was explainedm in the first lab, there are a number of categories, which need to have their instances converted to integers (both binary or one hot encoding type). Moreover, it is necessary to turn some of the floats to integers (features which entries should be integers, but some floats are displayed) and remove the smoking atribute.

With respect to missing or duplicated data, it is not necessary to input data and we will not deleted the few duplicated entries. Finally, we will also change the atributes names, in order to simplify them.

Before we proceed, let's take a look about how the Obesity Levels are distributed in our population.

We inverted the vegetables consumption and physical activity ranges. Now, the zero value means the highest frequency, whereas the 2 value means the lowest one.

  1. Gender - binary
  2. Age - integer
  3. Height - interval (float)
  4. Weight - interval (float)
  5. family - binary
  6. caloric_food - binary
  7. vegetables - ordinal
  8. n_meal - ordinal
  9. eat_bet_meal - ordinal
  10. water - ordinal
  11. monitor_cal - binary
  12. physical_active - ordinal
  13. screen_time - ordinal
  14. alcohol - ordinal
  15. transport - one hot encoding
  16. obesity - ordinal

Dimensionality Reduction with PCA:

Now, we will apply PCA in order to check if there are any pair os atributes that co-vary one with another:

Weight and Height does not correlate much with age. However, it seems to exist a correlation between Height and Weight. We will try doing PCA:

The PCA vector shows that we could easily remove the Height attribute, without the need of the PCA. However, we will use this vector in order to project Weight and Height.

That projection is given by:

Let's take a look into the Explained Variance, in order to check the amount of the original data contained in the PCA vector:

As it was already expected, the dimensionality reduction recovered over than 99% of the original data.

Question: Divide your data into training and testing data using an 80% training and 20% testing split. Use the cross validation modules that are part of scikit-learn. Argue "for" or "against" splitting your data using an 80/20 split. That is, why is the 80/20 split appropriate (or not) for your dataset?

Age and Weight attributes have a wide range. Therefore, we will normalize them before we proceed.

Now, we will validate it using sklearn .

Apparently, our model could predict about 80% of the original data (using SkLearn modules). The result seems to be suitable, but before any conclusion, let's use numpy and check its accuracy.

Modeling

Question: Create a custom, one-versus-all logistic regression classifier using numpy and scipy to optimize. Use object oriented conventions identical to scikit-learn. You should start with the template developed by the instructor in the course. You should add the following functionality to the logistic regression classifier:

Let's add the classes for each optimization technique:

Question: Train your classifier to achieve good generalization performance. That is, adjust the optimization technique and the value of the regularization term(s) "C" to achieve the best performance on your test set. Visualize the performance of the classifier versus the parameters you investigated. Is your method of selecting parameters justified? That is, do you think there is any "data snooping" involved with this method of selecting parameters?

Question: Compare the performance of your "best" logistic regression optimization procedure to the procedure used in scikit-learn. Visualize the performance differences in terms of training time and classification performance. Discuss the results.

Deployment

Question: Which implementation of logistic regression would you advise be used in a deployed machine learning model, your implementation or scikit-learn (or other third party)? Why?

Exceptional Work

Question: Choose ONE of the following: